The topic of this project is the exploration of shark incidents in California based on a dataset provided by the California Department of Fish and Wildlife. The research questions will focus on identifying patterns in the data rather than establishing causality. For example, some of the research questions include:
Which species of sharks are most commonly involved in incidents?.
Is there a specific activity that is more prone to shark incidents?
What types of injuries are most common in shark incidents, if a shark attacks you… will it kill you for sure?
Is there a town or city where attacks are more common?
There is a depth, mostly incidents occur?
Is there a month or time when more accidents occur?
What percentage of people attend to beach per day and what percentage of beach visitors bathe??
What is the incidents per-capita?
Are sharks really a danger to humans?
Introduction
Shark incidents have long fascinated researchers and the public alike, given the potentially fatal nature of these encounters and their connection to popular beach activities. This project aims to explore the characteristics of shark incidents in California, specifically focusing on the types of injuries sustained (fatal, major, minor or none). We want to know using data, the probability that if a shark attack someone, that person would be dead.
Also We want to explore, the common activities during attacks (scuba diving, Swimming, surfing and other more). and whether particular shark species are more likely to be involved. And last interesting question We want to approach in this study, is what particular shark species are more likely to be involved. The study will provide valuable exploratory insights into the patterns of these incidents, which could help inform safety measures for coastal activities. By analyzing historical data on shark incidents, this research will help understand the frequency of attacks and potentially identify the environmental and behavioral factors that correlate with different injury types.
Data Sources
The primary data source for this project is the California Department of Fish and Wildlife’s dataset on shark incidents, last updated in March 2024. This dataset includes a wide range of variables, such as the date, time, location, water depth, type of human activity, species of shark involved, and the severity of injuries. The data is original and directly collected from incidents reported in California, ensuring its reliability. However, there may be some concerns related to incomplete data entries or inconsistencies in how certain variables were recorded. These issues will be addressed through data cleaning and preparation. For example, the depth and injury fields will need to be standardized, and any missing or ambiguous entries will be carefully handled to maintain the integrity of the analysis.
Link 4 is quite important as present a study titled “Two-year migration of adult female white sharks (Carcharodon carcharias) reveals widely separated nursery areas and conservation concerns”, published in Animal Biotelemetry, investigates the migratory patterns and ecological significance of adult female white sharks. Utilizing satellite tracking data collected over two years, the research highlights the extensive movement and habitat usage of these apex predators.
Key findings of the study include:
Identification of Migration Routes: Adult female white sharks exhibit long-distance migratory behavior, connecting distinct geographic regions, including widely separated nursery areas.
Anticipated Results
In this analysis, we expect to examine two key variables: the mode of activity during the shark incident and the type of injury sustained. Based on the dataset, it is anticipated that activities like surfing and swimming will show a higher frequency of incidents. Additionally, injuries will likely range from minor to fatal, with surfing-related activities potentially resulting in more severe outcomes. We anticipate that these variables will be distributed in a way that shows many patterns, where certain activities dominate the data. For example, it’s expected that shark incidents are concentrated in shallow waters where swimming and surfing take place. We also expect to find relationships between shark species and injury severity, where species like the Great White Shark may be more commonly associated with fatal incidents.
The first chart illustrates the geographical distribution of shark incidents along the California coast, with a focus on the “Red Triangle” where incidents take place. This helps visualize the relationship between locations and incident frequency. The second chart shows shark incidents by activity type across different years, which helps to visualize the expected distribution of incidents across different activities and how this may change over time.
The distribution of activity types and injury severities will inform our understanding of which activities are most prone to shark incidents and how dangerous these incidents can be. The relationships between activity type and injury type will help answer research questions about what factors contribute to the likelihood and severity of shark-related injuries.
The main goal of this study is to demystify the danger that sharks pose to humans, and to establish calm in sectors that still consider them a threat to society. We will try to do this with real data obtained from government databases, which we will explain later.
Data Cleaning
Dataset 1 Cleaning (Sharks Incidents)
This dataset contains information of Sharks incidents in California.
First, lets load the data in R, and see the first 6 observations.
Code
head(data)
#> # A tibble: 6 × 14
#> IncidentNum Date Time County Location Mode Injury Depth
#> <chr> <dttm> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 1950-10-08 00:00:00 0.5 San D… Imperia… Swim… major surf…
#> 2 2 1952-05-27 00:00:00 0.58333333… San D… Imperia… Swim… minor surf…
#> 3 3 1952-12-07 00:00:00 0.58333333… Monte… Lovers … Swim… fatal surf…
#> 4 4 1955-02-06 00:00:00 0.5 Monte… Pacific… Free… minor surf…
#> 5 5 1956-08-14 00:00:00 0.6875 San L… Pismo B… Swim… major surf…
#> 6 6 1957-04-28 00:00:00 0.5625 San L… Morro B… Swim… fatal surf…
#> # ℹ 6 more variables: Species <chr>, Comment <chr>, Longitude <chr>,
#> # Latitude <dbl>, `Confirmed Source` <chr>, `WFL Case #` <chr>
Code
data = data[1:(nrow(data) -9), ]data <- data[, -c(10:14)]
Last 9 rows do not provide any information. So this is necessary to clean data.
For further processing we transform the variables into the desired type. For example, the variables Species ,County, Mode are taken by R as character, but for us it would be better to consider them as factor.
#> IncidentNum Date Time
#> Min. : 1.00 Min. :1950-10-08 00:00:00.000 Min. :0.2812
#> 1st Qu.: 53.25 1st Qu.:1985-03-14 06:00:00.000 1st Qu.:0.4167
#> Median :104.50 Median :2004-09-10 00:00:00.000 Median :0.5000
#> Mean :103.81 Mean :1997-12-23 18:10:41.583 Mean :0.5316
#> 3rd Qu.:154.75 3rd Qu.:2013-09-27 00:00:00.000 3rd Qu.:0.6380
#> Max. :205.00 Max. :2022-02-26 00:00:00.000 Max. :0.9583
#> NA's :18
#> County Location Mode
#> San Diego :23 Salmon Creek Beach : 9 Surfing / Boarding :80
#> Santa Barbara:19 Farallon Islands : 7 Freediving :35
#> Humboldt :18 Tomales Point : 7 Kayaking / Canoeing:29
#> San Mateo :18 Moonstone Beach : 5 Swimming :22
#> Marin :16 San Onofre State Beach: 5 Scuba Diving :19
#> Monterey :15 La Jolla : 4 Hookah Diving :10
#> (Other) :93 (Other) :165 (Other) : 7
#> Injury Depth Species number Year
#> fatal:15 Min. : 0.000 White :179 Min. : 1.00 Min. :1950
#> major:59 1st Qu.: 0.000 Unknown : 13 1st Qu.: 51.25 1st Qu.:1985
#> minor:49 Median : 0.000 Hammerhead: 3 Median :101.50 Median :2004
#> none :79 Mean : 3.322 Blue : 2 Mean :101.50 Mean :1997
#> 3rd Qu.: 0.000 Leopard : 2 3rd Qu.:151.75 3rd Qu.:2013
#> Max. :72.000 Salmon : 1 Max. :202.00 Max. :2022
#> NA's :3 (Other) : 2
#> Month Day
#> October :36 Length:202
#> August :31 Class :character
#> September:31 Mode :character
#> July :23
#> May :16
#> November :16
#> (Other) :49
Dataset 2 Cleaning (Beaches Attendance)
data2 dataset has some differences in county names comparing to data dataset, If We are going to compare this datasets and share information in the same time then, It will necessary to have same County names.
#> [1] Los Angeles Santa Cruz
#> [3] California State Parks San Diego
#> [5] Orange Alameda
#> [7] Santa Barbara Ventura
#> [9] East Bay Regional Parks District San Francisco
#> [11] Marin Humboldt
#> [13] San Luis Obispo Santa Clara
#> [15] Sonoma
#> 15 Levels: Alameda California State Parks ... Ventura
The information provided by these bar plots is revealing, white sharks are the only species that are present in fatal accidents and in those with quite high damage, giving us the understanding that it is the only species that humans can become afraid of.
2. What relationship is between the activity and the Injury type?
To answer this interesting question we created a bar plot for each type of injury, in which we will see which species are involved in each type of injury.
# Order activities by the total number of incidentstotal_incidents_per_mode <- injury_mode_count %>%group_by(Mode) %>%summarise(total_incidents =sum(injury_count)) %>%arrange(desc(total_incidents))injury_mode_count$Mode <-factor(injury_mode_count$Mode, levels = total_incidents_per_mode$Mode)# Create the faceted bar chart with adjusted label positionsp<-ggplot(injury_mode_count, aes(x = Mode, y = injury_count, fill = Mode)) +geom_bar(stat ="identity") +geom_text(aes(label = injury_count), hjust =2, size =5) +coord_flip() +labs(title ="Injury Types by Activity",x ="Activity",y ="Number of Incidents") +scale_fill_viridis_d(option ="plasma", guide ="none") +facet_wrap(~ Injury, scales ="free_y") +expand_limits(y =max(injury_mode_count$injury_count) *1.1) +theme_minimal() +theme(plot.title =element_text(color="coral",face ="bold.italic"),axis.title.x =element_text(size=14, face="bold"),axis.title.y =element_text(size=14, face="bold"),axis.text.x =element_text(size =10,face="bold"),axis.text.y =element_text(size =8),strip.text =element_text(size =12, face ="bold"))ggplotly(p)
We see that the most dangerous activities are on the surface, swimming, surfing, freediving. Diving has very few accidents.
3. What types of injuries are most common in shark incidents, if a shark attacks you… will it kill you for sure?
To answer this we created a waffle chart in which we will see the proportion of the type of damage, we will focus on the number of fatal accidents and also the incidents that did not present any damage.
We see that if someone have a shark attack not always is going to die. In fact the the 39 % of attacks did not damage.
4. Is there a town or city where attacks are more common?
To answer this We create a barchart.
Code
Table =table(data$County, data$Injury)Table
#>
#> fatal major minor none
#> Del Norte 0 0 2 1
#> Humboldt 0 7 2 9
#> Island - Catalina 0 0 1 3
#> Island - Farallones 0 7 0 0
#> Island - San Miguel 1 2 2 0
#> Island - San Nicolas 0 0 1 0
#> Island - Santa Barbara 0 0 0 0
#> Island - Santa Cruz 0 0 1 1
#> Island - Santa Rosa 0 1 0 0
#> Los Angeles 1 0 6 2
#> Marin 0 9 4 3
#> Mendocino 1 3 1 0
#> Monterey 2 8 2 3
#> Orange 0 1 2 5
#> San Diego 2 4 8 9
#> San Francisco 1 0 0 1
#> San Luis Obispo 3 3 1 7
#> San Mateo 1 1 4 12
#> Santa Barbara 2 2 6 9
#> Santa Cruz 1 3 3 8
#> Sonoma 0 8 1 6
#> Ventura 0 0 2 0
Code
county_counts <- data %>%count(County) %>%arrange(desc(n))p <-ggplot(county_counts, aes(x =reorder(County, n), y = n, fill = County)) +geom_bar(stat ="identity", fill ="steelblue") +geom_text(aes(label = n), hjust =3, size =5.5, color ="black") +coord_flip() +labs(title ="Shark Incidents per County",x ="County",y ="Number of Incidents") +theme_minimal() +theme(plot.title =element_text(color="coral",size =16, face ="bold.italic"),axis.title.x =element_text(size=14, face="bold"),axis.title.y =element_text(size=14, face="bold"),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),legend.position ="none")ggplotly(p)
Here we can see that the city where the largest number of accidents occurred was san diego with a little over 25 attacks recorded. San Francisco, for example, does not have many attacks (less than 5) and Los Angeles, which is a large city, has less than 10.
We see that the highest number of accidents, both fatal and non-fatal, were recorded for activities carried out on the surface, giving us the understanding, for example, that diving is a safe activity if we refer to the danger of being attacked by a shark.
6. Is there a month or time when more accidents occur?
Code
# Extract month from the Date column and create a month factordata$Month <-format(as.Date(data$Date, format ="%Y-%m-%d"), "%B")data$Month <-factor(data$Month, levels = month.name, ordered =TRUE)# Count the number of incidents per monthmonth_counts <- data %>%count(Month) %>%arrange(Month)# Plot the number of shark incidents by monthp<-ggplot(month_counts, aes(x = Month, y = n, group =1)) +geom_line(color ="darkred", size =1) +geom_point(color ="darkred", size =2) +labs(title ="Shark Incidents by Month",x ="Month",y ="Number of Incidents") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(color ="Coral",size =16, face ="bold"),axis.title.x =element_text(size =16, face ="bold"),axis.title.y =element_text(size =16, face ="bold"))ggplotly(p)
According to information taking from: “Two-year migration of adult female white sharks (Carcharodon carcharias) reveals widely separated nursery areas and conservation concerns” The white shark mating season occurs in spring or summer, in temperate waters, We conclude that the highest number of accidents occur at the end of the mating seasons.
But there was an even more important issue, we have enough evidence that the highest number of accidents occur during surfing activities by far, so it is important to know on what dates this activity is practiced according to information taken from… the month with the best weather for surfing is October. The other activities with the highest number of accidents are apnea and swimming, which for reasons of weather and vacations are practiced in summer from June to October, which makes sense given the time series graph shown above.
Attendance and Per capita Incidents
7. What percentage of people attend to beach per day and what percentage of beach visitors bathe?
This is where we will motivate the idea of getting an estimate of the number of people who attended each day using the estimates obtained from the study: “the human tide: beach attendance and bathing rates for southern california beaches”. in which the following conclusion is reached, of monthly attendance in percentage, this percentage was obtained in the year 2007, but we will use it to estimate the other years (because the availability of data is a complicated challenge):
#> Month month_percentage
#> 1 January 0.03855826
#> 2 February 0.03352892
#> 3 March 0.05029338
#> 4 April 0.05867561
#> 5 May 0.07544007
#> 6 June 0.13411567
#> 7 July 0.23470243
#> 8 August 0.20117351
#> 9 September 0.10058676
#> 10 October 0.04526404
#> 11 November 0.03352892
#> 12 December 0.03352892
They also provide us with the percentage of beach attendance per day.
#> Month month_percentage month_asis_percentage
#> 1 January 0.26 0.010025147
#> 2 February 0.28 0.009388097
#> 3 March 0.33 0.016596815
#> 4 April 0.31 0.018189438
#> 5 May 0.41 0.030930427
#> 6 June 0.50 0.067057837
#> 7 July 0.52 0.122045264
#> 8 August 0.54 0.108633697
#> 9 September 0.50 0.050293378
#> 10 October 0.36 0.016295054
#> 11 November 0.29 0.009723386
#> 12 December 0.27 0.009052808
The graph above was taken from “Beach Attendance and Bathing Rates for Southern California Beaches”.
In dataset 1 there is information from 1950’s to 2022.
Code
summary(data$Year)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1950 1985 2004 1997 2013 2022
In dataset 2 there is information from 1964 to 2023, so We do not have information on visits to the beaches before 1964.
Code
summary(data2$Year)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1964 1990 2003 2001 2014 2023
It is also important to note that the information on beach visits does not exactly match the same counties for the two datasets. So it is a challenge to combine the information (as well as to search for it, which was very tedious.)
Code
p<-ggplot(data, aes(x =reorder(County, -table(County)[County]))) +geom_bar(fill ="steelblue", color ="black") +labs(title ="Number of incidents per county",x ="Counties",y ="Frequency") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(color ="Coral",size =16, face ="bold"),axis.title.x =element_text(size =16, face ="bold"),axis.title.y =element_text(size =15, face ="bold"))ggplotly(p)
Code
p<-ggplot(data2, aes(x =reorder(County, -table(County)[County]))) +geom_bar(fill ="steelblue", color ="black") +labs(title ="Counties in dataset 2",x ="Counties",y ="Frequency") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(color ="Coral",size =16, face ="bold"),axis.title.x =element_text(size =16, face ="bold"),axis.title.y =element_text(size =15, face ="bold"))ggplotly(p)
Calculating the number of people who went to each beach per year and per county,
Addressing this question is quite complicated because it is not necessarily accessible to obtain information on the number of people who attend the beach on the exact days in which accidents occurred. Searching for this data day by day is an overwhelming task, so for certain counties it can only be answered with the information that we were able to find after much effort in dataset 2.
It is important to note that there is no data on attendance at some beaches, especially some islands. Therefore, only the cases of accidents per capita will be studied for places where we do have the data.
#> # A tibble: 6 × 2
#> County attendance_county
#> <fct> <dbl>
#> 1 Humboldt 0
#> 2 Los Angeles 479493906
#> 3 Marin 3100000
#> 4 Orange 271829516
#> 5 San Diego 598154151
#> 6 San Francisco 990000
Now We present total incidents ocurred from 1950 to 2022. (We present this graph above).
Code
county_counts <- data %>%count(County) %>%arrange(desc(n))
Now We present total attendance of people from 1964 to 2023.
Code
data4
#> # A tibble: 11 × 2
#> County attendance_county
#> <fct> <dbl>
#> 1 Humboldt 0
#> 2 Los Angeles 479493906
#> 3 Marin 3100000
#> 4 Orange 271829516
#> 5 San Diego 598154151
#> 6 San Francisco 990000
#> 7 San Luis Obispo 6173998
#> 8 Santa Barbara 45762839
#> 9 Santa Cruz 23143415
#> 10 Sonoma 817187
#> 11 Ventura 392489
So We are in conditions to calculate incidents per capita in each county. We cannot calculate for all counties because we do not have the data for all of them. But we do have the data for the most important ones, in the document named: “Beach Attendance and Bathing Rates for Southern California Beaches”, is estimated that about 45% of beach visitors actively engage in recreational water contact anually, It makes no sense to calculate the incidents per capita for people who do not go into the sea.
#> # A tibble: 10 × 5
#> County n attendance_county attendance_contact_w…¹ per_capita_incident
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 san diego 23 598154151 269169368. 0.0000000854
#> 2 santa bar… 19 45762839 20593278. 0.000000923
#> 3 marin 16 3100000 1395000 0.0000115
#> 4 santa cruz 15 23143415 10414537. 0.00000144
#> 5 sonoma 15 817187 367734. 0.0000408
#> 6 san luis … 14 6173998 2778299. 0.00000504
#> 7 los angel… 9 479493906 215772258. 0.0000000417
#> 8 orange 8 271829516 122323282. 0.0000000654
#> 9 san franc… 2 990000 445500 0.00000449
#> 10 ventura 2 392489 176620. 0.0000113
#> # ℹ abbreviated name: ¹attendance_contact_water
As we can see, the incidents per capita are insignificant, with a maximum probability of 0.0000113 (which is the case of Ventura beach). So We have concluded the main goal in this research, We are in conditions to claim that It is almost impossible that a Shark kill someone.